Improve performance of drop table in iceberg connector#15981
Conversation
There was a problem hiding this comment.
Throwing RuntimeException is different from the interface's expectation. It would be nice to throw BulkDeletionFailureException or leave a comment.
There was a problem hiding this comment.
BulkDeletionFailureException requires the number of files that failed to delete. Currently fileSystem.deleteFiles does not give such information. I will prepare a PR for the same.
d5565c6 to
1f921c2
Compare
There was a problem hiding this comment.
I would implement this directly in ForwardingFileIo, not as a subclass.
(that would probably also satisfy https://github.com/trinodb/trino/pull/15981/files#r1097092879, right?)
cc @electrum
There was a problem hiding this comment.
what is org.apache.iceberg.io.BulkDeletionFailureException#numberFailedObjects needed for?
if we don't need it, we can skip having a TODO
There was a problem hiding this comment.
SupportsBulkOperations::deleteFiles throws BulkDeletionFailureException exception. BulkDeletionFailureException will be constructed by providing an integer value numberFailedObjects which means the number of files that failed to delete.
There was a problem hiding this comment.
SupportsBulkOperations::deleteFiles throws BulkDeletionFailureException exception.
I know.
BTW it can also throw any unckeched exception, obviously.
numberFailedObjects which means the number of files that failed to delete.
I figured this from the name :)
How does Iceberg lib use that information?
17b4302 to
30eff29
Compare
|
Addressed comments. CI is failing due to a flaky test
|
There was a problem hiding this comment.
This isn’t thrown by the method. Should we be using this instead?
There was a problem hiding this comment.
we cannot throw BulkDeletionFailureException becauase we cannot fill BulkDeletionFailureException#numberFailedObjects
see #15981 (comment) for more
There was a problem hiding this comment.
I don't think including all file paths to the error message is a good idea in case of large deletes. It may include 1,000 files.
There was a problem hiding this comment.
@ebyhr you convinced me, let's not make the message that long. OTOH, it could be problematic if we do not include any paths in the message. It can make identifying problems harder (eg which bucket did we try to access?).
I pushed a change to display only first 5 paths and then ellipsis ....
b986679 to
659a3b0
Compare
659a3b0 to
3310ec6
Compare
Description
While working on Iceberg small files benchmarking. I noticed that the drop table is very slow when the number of files is too large for a table. The main reason for slowness is
dropTableData, which deletes the files concurrently when thefileIOis not an instance ofSupportsBulkOperations. we can delete files in bulk iffileIOis an instance ofSupportsBulkOperations. So implementsSupportsBulkOperationsinstead ofFileIOinForwardingFileIo.java.I tested it on my local machine. A table with glue metastore containing data files approx 10K took ~26 minutes without this change whereas with this change it only took ~20 seconds.
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text: